Support for --default-name and --default-output in create command by jgalar · Pull Request #3 · PSRCode/lttng-tools

jgalar · 2016-06-15T21:07:26Z

No description provided.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

There is no need for 4 bytes to express a boolean state. Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

validate_command: checks for invalid options configuration create_session_basic: responsible for session creation and configuration generate_output: responsible of outputting the info to the user Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Found by Coverity: CID 1243017 (#1 of 4): Buffer not null terminated (BUFFER_SIZE_WARNING)14. buffer_size_warning: Calling strncpy with a maximum size argument of 264 bytes on destination array msg.channel_name of size 264 bytes might leave the destination string unterminated. ID 1243017 (#2 of 4): Buffer not null terminated (BUFFER_SIZE_WARNING)14. buffer_size_warning: Calling strncpy with a maximum size argument of 264 bytes on destination array msg_2_2.channel_name of size 264 bytes might leave the destination string unterminated. CID 1243017 (#3 of 4): Buffer not null terminated (BUFFER_SIZE_WARNING)15. buffer_size_warning: Calling strncpy with a maximum size argument of 4096 bytes on destination array msg.pathname of size 4096 bytes might leave the destination string unterminated. CID 1243017 (#4 of 4): Buffer not null terminated (BUFFER_SIZE_WARNING)15. buffer_size_warning: Calling strncpy with a maximum size argument of 4096 bytes on destination array msg_2_2.pathname of size 4096 bytes might leave the destination string unterminated. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Background ========== I have a file named "/a" on my file system (don't ask why). Issue ===== While running the `test_utils_expand_path` test case on my machine, I get a Segfault. Here is the gdb backtrace: #0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:62 #1 0x0000555555559eb7 in expand_double_slashes_dot_and_dotdot (path=0x0) at utils.c:223 #2 0x000055555555a2d8 in _utils_expand_path (path=0x55555556b250 "/a/b/c/d/e", keep_symlink=true) at utils.c:384 #3 0x000055555555a408 in utils_expand_path (path=0x55555556b250 "/a/b/c/d/e") at utils.c:423 #4 0x000055555555859e in test_utils_expand_path () at test_utils_expand_path.c:291 lttng#5 0x00005555555589b0 in main (argc=1, argv=0x7fffffffe5e8) at test_utils_expand_path.c:352 I get this backtrace because the function `utils_partial_realpath()` returns NULL when it tries to expand the "/a/b/c/d/e" path and realize that it could not exist since "/a" is a file and not a directory. Anyways, the returned NULL pointer is ignored and directly used in the `expand_double_slashes_dot_and_dotdot()` function right after. This configuration ("/a" being a file) is expected to fail but not to segfault. It could be reproduce in a real scenario when creating directory structures. Solution ======== Return an error if `utils_partial_realpath()` returns NULL. Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Observed issue ============== As reported in #1323 (https://bugs.lttng.org/issues/1323), crashes of the relay daemon are observed when running the user space clear tests. The crash occurs with the following stack trace: #0 0x000055fbb861d6ae in urcu_ref_get_unless_zero (ref=0x28) at /usr/local/include/urcu/ref.h:85 #1 lttng_trace_chunk_get (chunk=0x0) at trace-chunk.c:1836 #2 0x000055fbb86051e2 in make_viewer_streams (relay_session=relay_session@entry=0x7f6ea002d540, viewer_session=<optimized out>, seek_t=seek_t@entry=LTTNG_VIEWER_SEEK_BEGINNING, nb_total=nb_total@entry=0x7f6ea9607b00, nb_unsent=nb_unsent@entry=0x7f6ea9607aec, nb_created=nb_created@entry=0x7f6ea9607ae8, closed=<optimized out>) at live.c:405 #3 0x000055fbb86061d9 in viewer_get_new_streams (conn=0x7f6e94000fc0) at live.c:1155 #4 process_control (conn=0x7f6e94000fc0, recv_hdr=0x7f6ea9607af0) at live.c:2353 lttng#5 thread_worker (data=<optimized out>) at live.c:2515 lttng#6 0x00007f6eae86a609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 lttng#7 0x00007f6eae78f293 in clone () from /lib/x86_64-linux-gnu/libc.so.6 The race window during which this occurs seems very small as it can take hours to reproduce this crash. However, a minimal reproducer could be identified, as stated in the bug report. Essentially, the same crash can be reproduced by attaching a live viewer to a session that has seen events being produced, been stopped and been cleared. Cause ===== The crash occurs as an attempt is made to take a reference to a viewer session’s trace chunk as viewer streams are created. The crux of the problem is that the code doesn’t expect a viewer session’s trace chunk to be NULL. The viewer session’s current trace chunk is initially set, when a viewer attaches to the viewer session, to a copy the corresponding relay_session’s current trace chunk. A live session always attempts to "catch-up" to the newest available trace chunk. This means that when a viewer reaches the end of a trace chunk, the viewer session may not transition to the "next" one: it jumps to the most recent trace chunk available (the one being produced by the relay_session). Hence, if the producer performs multiple rotations before a viewer completes the consumption of a trace chunk, it will skip over those "intermediary" trace chunks. A viewer session updates its current trace chunk when: 1) new viewer streams are created, 2) a new index is requested, 3) metadata is requested. Hence, as a general principle, the viewer session will reference the most recent trace chunk available _even if its streams do not point to it_. It indicates which trace chunk viewer streams should transition to when the end of their current trace chunk is reached. The live code properly handles transitions to a null chunk. This can be verified by attaching a viewer to a live session, stopping the session, clearing it (thus entering a null trace chunk), and resuming tracing. The only issue is that the case where the first trace chunk of a viewer session is "null" (no active trace chunk) is mishandled in two places: 1) in make_viewer_streams(), where the crash is observed, 2) in viewer_get_metadata(). Solution ======== In make_viewer_streams(), it is assumed that a viewer session will have a non-null trace chunk whenever a rotation is not ongoing. This is reflected by the fact that a reference is always acquired on the viewer session’s trace chunk. That code is one of the three places that can cause a viewer session’s trace chunk to be updated. We still want to update the viewer session to the most recently seen trace chunk (null, in this case). However, there is no reference to acquire and the trace chunk to use for the creation of the viewer stream is NULL. This is properly handled by viewer_stream_create(). The second site to change is viewer_get_metadata() which doesn’t handle a viewer metadata stream not having an active trace chunk at all. Thankfully, the protocol allows us to express this condition by returning the LTTNG_VIEWER_NO_NEW_METADATA status code when a viewer metadata stream doesn’t have an open file and doesn’t have a current trace chunk. Surprisingly, this bug didn’t trigger in the case where a transition to a null chunk occurred _after_ attaching to a viewer session. This is because viewers will typically ask for metadata as a result of an LTTNG_VIEWER_FLAG_NEW_METADATA reply to the GET_NEXT_INDEX command. When a session is stopped and all data was consumed, this command returns that no new data is available, causing the viewers to wait and ask again later. However, when attaching, babeltrace2 (at least, and probably babeltrace 1.x) always asks for an initial segment of metadata before asking for an index. Known drawbacks =============== None. Fixes: #1323 Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: I516fca60755e6897f6b7170c12d706ef57ad61a5

Observed issue ============== When performing: #!/bin/bash lttng create py_syscalls --live lttng enable-event -u -a lttng enable-event -k -a lttng start babeltrace2 -i lttng-live net://localhost/host/raton/py_syscalls The relay daemon hits this assertion: Thread 8 (Thread 0x7fffeeffd700 (LWP 167040) "lttng-relayd"): #0 0x00007ffff7b1618b in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007ffff7af5859 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007ffff7af5729 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007ffff7b06f36 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x00005555555889bb in viewer_session_attach (vsession=0x7fffdc001400, session=session@entry=0x7fffe8001180) at viewer-session.c:80 lttng#5 0x000055555557bcff in viewer_attach_session (conn=0x7fffd0001140) at live.c:1275 lttng#6 process_control (conn=0x7fffd0001140, recv_hdr=0x7fffeeffcaf0) at live.c:2341 lttng#7 thread_worker (data=<optimized out>) at live.c:2515 lttng#8 0x00007ffff7ccd609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 lttng#9 0x00007ffff7bf2293 in clone () from /lib/x86_64-linux-gnu/libc.so.6 Cause ===== This assert appears to be entirely wrong. It checks that the "viewer session" has a NULL current trace chunk when attaching a session to a viewer session, but in the case where a viewer session has multiple sessions (e.g. with kernel and ust tracing combined), we are attaching each session individually to the viewer session, and we set the current trace chunk of the viewer session when we attach the first session to it. So it is expected to be non-NULL when attaching the second session. Solution ======== Remove the assertion. Known limitations ================= None. Fixes: #1335 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: I4d8f5d5347b4588144ddf449976cae5a94b81b3a

…y_name Observed issue ============== The lttng-sessiond asserts with the following backtrace on lttng create: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffff7ab5859 in __GI_abort () at abort.c:79 #2 0x00007ffff7ab5729 in __assert_fail_base (fmt=0x7ffff7c4b588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555556ab0a6 "node_ptr == &node->node", file=0x5555556ab085 "hashtable.c", line=298, function=<optimized out>) at a #3 0x00007ffff7ac6f36 in __GI___assert_fail (assertion=assertion@entry=0x5555556ab0a6 "node_ptr == &node->node", file=file@entry=0x5555556ab085 "hashtable.c", line=line@entry=298, function=function@entry=0x5555556ab380 <__PRETTY_FUNCTIO #4 0x000055555560be44 in lttng_ht_add_unique_str (ht=<optimized out>, node=0x7fffe0026c58) at hashtable.c:298 lttng#5 0x000055555558fb6a in add_session_ht (ls=0x7fffe0024970) at session.c:372 lttng#6 session_create (name=<optimized out>, uid=1000, gid=1000, out_session=out_session@entry=0x7fffedfddbd8) at session.c:1308 lttng#7 0x000055555559b219 in cmd_create_session_from_descriptor (creds=<optimized out>, creds=<optimized out>, home_path=<optimized out>, descriptor=<optimized out>) at cmd.c:3040 lttng#8 cmd_create_session (cmd_ctx=cmd_ctx@entry=0x7fffedfe5fa0, sock=<optimized out>, return_descriptor=return_descriptor@entry=0x7fffedfddd68) at cmd.c:3176 lttng#9 0x00005555555cc341 in process_client_msg (sock_error=0x7fffedfddd10, sock=0x7fffedfddd0c, cmd_ctx=0x7fffedfe5fa0) at client.c:2177 lttng#10 thread_manage_clients (data=<optimized out>) at client.c:2742 lttng#11 0x00005555555c5fff in launch_thread (data=0x55555571b780) at thread.c:66 lttng#12 0x00007ffff7c8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477 lttng#13 0x00007ffff7bb2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The issue can be reproduced with modifications to the rotation thread code and the following scenario: $ lttng create test $ lttng enable-event -u -a $ lttng start run any app just so that we have a complete valid session. (might not be necessary) $ lttng destroy --no-wait $ lttng create test ^ Should assert here. The diff to be applied: diff --git a/src/bin/lttng-sessiond/rotation-thread.cpp b/src/bin/lttng-sessiond/rotation-thread.cpp index ac149c8..c11f068ed 100644 --- a/src/bin/lttng-sessiond/rotation-thread.cpp +++ b/src/bin/lttng-sessiond/rotation-thread.cpp @@ -565,6 +565,8 @@ int handle_job_queue(struct rotation_thread_handle *handle, { int ret = 0; + sleep(5); + for (;;) { struct ltt_session *session; struct rotation_thread_job *job; Note that the initial report for this issue was on a system under load for which the `lttng destroy` completion check failed and a `lttng create` was performed. As of today the exact reason why the completion check failed is not known. Still we can "fix" the race leading to the lttng-sessiond assertion considering a user might use the `--no-wait` variant of `lttng destroy` and could easily end up in this situation. Cause ===== Note: all `lttng create` commands have the same session name passed as argument. On `lttng destroy` the ltt_session object is flagged as destroyed (ltt_session::destroyed). The removal of the object from the hash table (ltt_sessions_ht_by_name) will be performed during the `session_release` which is driven by the session refcount. A reference on the `ltt_session` object is held for the rotation initiated by the `lttng destroy` command. The rotation is enqueued by the rotation thread. At this point the system is busy and the rotation thread does not run. We simulate this with a `sleep(5)` during the `handle_job_queue`. The `lttng destroy --no-wait` returns. If the `--no-wait` option is not passed the `lttng destroy` command will work as expected and wait for completion. We can SIGINT the `lttng destroy` command and perform a `lttng create` yielding the same backtrace. On `lttng create`, `session_create` validates that the name does not conflict with an existing session using `session_find_by_name`. It is important to note that `session_find_by_name` discriminates also on the `session->destroyed` flag (introduced by [1]). The `ltt_sessions_ht_by_name` hash table was introduced by [2] to remove the need to lock the session list to sample a session id during the queueing of actions to be executed related to a trigger. The assumption was made that, during the creation phase, the session would always be unique in that hash table based on its name. This is simply not true since multiple sessions with the same name can coexist as long as only a single one is marked as "not destroyed". This is an important concept due to the refcounting of the object and the feature relying on the lifetime of the object (i.e rotation). This is mostly valid when talking about the global session list. Solution ======== Move the hash table removal earlier during the release of the session object. Move the removal from `del_session_ht`, which is done during the `session_release` function, to the `lttng_session_destroy` function. It is safe to do so since currently the only user of that hash table (the action executor) does not care much about destroyed session at that point. This ensures that we maintain the uniqueness property of the key (name) for that hash table on insertion. The alternative was to expose an hash table that could contain duplicates and force the handling of a set on all lookups. Known drawbacks ========= None. References ========== [1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=e32d7f274604b77bcd83c24994e88df3761ed658 [2] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=e1bbf98908a6399f39a9a8bc95bd8b59cecaa816 Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: I2f1d0d6c04ee7210166e9847a850afbe6eaa7609

Observed issue ============== lttng-consumerd segfaults for the following scenario: $ lttng create test --snapshot $ lttng enable-event -u -an $ lttng start # Run an app just to have some events $ lttng regenerate metadata $ lttng snapshot record The following backtrace is obtained: (gdb) bt #0 __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67 #1 0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347 #2 0x000055b383d00197 in lttng_ustconsumer_request_metadata (ctx=0x55b3855a1e50, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:3336 #3 0x000055b383cf9e76 in snapshot_metadata (metadata_channel=0x7fe90000a760, key=4, path=0x7fe911a09944 "uid/1000/64-bit", relayd_id=18446744073709551615, ctx=0x55b3855a1e50) at ust-consum #4 0x000055b383cfbe73 in lttng_ustconsumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at ust-consumer.c:1853 lttng#5 0x000055b383ccf9b7 in lttng_consumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at consumer.c:2097 lttng#6 0x000055b383cd3bfd in consumer_thread_sessiond_poll (data=0x55b3855a1e50) at consumer.c:3284 lttng#7 0x00007fe914c22609 in start_thread (arg=<optimized out>) at pthread_create.c:477 lttng#8 0x00007fe914b47293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) up #1 0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347 1347 pthread_mutex_lock(&channel->metadata_stream->lock); (gdb) print channel->metadata_stream $1 = (struct lttng_consumer_stream *) 0x0 Note that the following scenario also leads to a similar backtrace: $ lttng create test --snapshot $ lttng enable-event -u -an $ lttng start # Run an app just to have some events with a short duration $ lttng regenerate metadata # Run a second app just to have some events and a metadata flush while # the metadata cache status is set as `invalidated`. ^ lttng-consumerd segfault on app termination. The backtrace: (gdb) bt #0 __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67 #1 0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347 #2 0x00005612a5a14d0a in lttng_ustconsumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at ust-consumer.c:1818 #3 0x00005612a59e89b7 in lttng_consumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at consumer.c:2097 #4 0x00005612a59ecbfd in consumer_thread_sessiond_poll (data=0x5612a6feee50) at consumer.c:3284 lttng#5 0x00007f398c6a9609 in start_thread (arg=<optimized out>) at pthread_create.c:477 lttng#6 0x00007f398c5ce293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) up #1 0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347 1347 pthread_mutex_lock(&channel->metadata_stream->lock); (gdb) print channel->metadata_stream $1 = (struct lttng_consumer_stream *) 0x0 (gdb) Cause ===== For session configured in snapshot mode the metadata channel metadata_stream field is NULL except for a "short" window during the actual snapshot record action (snapshot_metadata). The `regenerate metadata` effectively flag the metadata cache as invalid leading to handling the cache invalidation state (`CONSUMER_METADATA_CACHE_WRITE_STATUS_INVALIDATED`) in `lttng_ustconsumer_recv_metadata`. This was introduced by b1316da [1]. At that point the function expects the `channel->metadata_stream` to be non null. This is simply not true for snapshot session metadata channels. Note that we cannot simply swap `lttng_ustconsumer_request_metadata` and `create_ust_streams` in `snapshot_metadata` to ensure that the `channel->metadata_stream` is non null since `lttng_ustconsumer_recv_metadata` ends up being called on metadata flush when an app quit. This sequence of events corresponds to the second scenario put forward in the `Observed Issue` section. Solution ======== Null check `channel->metadata_stream` and perform only the operation when it is non null. This partly mirror what is done in `consumer_metadata_wakeup_pipe`. I am not sure if the check on `channel->monitor` is required but it seems irrelevant to the notion of resetting the stream consumed position when the stream exists. With this taken care off, we find ourself with another backtrace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 [93/122] #1 0x00007f75cf9b3859 in __GI_abort () at abort.c:79 #2 0x00007f75cf9b3729 in __assert_fail_base (fmt=0x7f75cfb49588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004 #3 0x00007f75cf9c4f36 in __GI___assert_fail (assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004e8d7a "ust-consumer.c", line=1285, function=0x55ab004eb8a0 <__PR #4 0x000055ab004b1b9c in metadata_stream_reset_cache_consumed_position (stream=0x7f75b400a850) at ust-consumer.c:1285 lttng#5 0x000055ab004b4fef in commit_one_metadata_packet (stream=0x7f75b400a850) at ust-consumer.c:2551 lttng#6 0x000055ab004b5f46 in get_next_subbuffer_metadata (stream=0x7f75b400a850, subbuffer=0x7f75cc972630) at ust-consumer.c:2927 lttng#7 0x000055ab0048b6a9 in lttng_consumer_read_subbuffer (stream=0x7f75b400a850, ctx=0x55ab01d4ee50, locked_by_caller=true) at consumer.c:3522 lttng#8 0x000055ab004b0f66 in snapshot_metadata (metadata_channel=0x7f75b4005400, key=2, path=0x7f75cc972944 "uid/1000/64-bit", relayd_id=18446744073709551615, ctx=0x55ab01d4ee50) at ust-consum lttng#9 0x000055ab004b2e86 in lttng_ustconsumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at ust-consumer.c:1861 lttng#10 0x000055ab004869b7 in lttng_consumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at consumer.c:2097 lttng#11 0x000055ab0048abfd in consumer_thread_sessiond_poll (data=0x55ab01d4ee50) at consumer.c:3284 lttng#12 0x00007f75cfb8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477 lttng#13 0x00007f75cfab0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Which is also caused in part to the invalidation of the cache. `metadata_stream_reset_cache_consumed_position` expect that the stream at that point be locked. Which is not the case despite what the last argument to `lttng_consumer_read_subbuffer` indicates. To alleviate that we hold the lock during the call to `lttng_consumer_read_subbuffer`. Known drawbacks ========= None. References ========= [1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=b1316da1ffbd276fc8271e7a9438e683ad352781 Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: Ib09fdc989b1baae26d0f496afeabc3e7cb27228c

Observed issue ============== While running tests/regression/tools/clear/test_ust test in a loop we eventually witness the following error: The symptom on the Babeltrace side is a Connection reset by peer. This is caused by a relayd abort after an assertion failure due to a reference count being lower than 0. Test case: # Test ust streaming live clear with viewer # Parameters: tracing_active=0, clear_twice=0, buffer_type=pid Tools commit: https://review.lttng.org/c/lttng-tools/+/6754/1 Relay daemon assertion failure stack trace: (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffa2e850859 in __GI_abort () at abort.c:79 #2 0x00007ffa2e850729 in __assert_fail_base (fmt=0x7ffa2e9e6588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x56221bdb96f4 "res >= 0", file=0x56221bdb97a0 "/root/virtenv/usr/include/urcu/ref.h", line=66, function=<optimized out>) at assert.c:92 #3 0x00007ffa2e861f36 in __GI___assert_fail (assertion=assertion@entry=0x56221bdb96f4 "res >= 0", file=file@entry=0x56221bdb97a0 "/root/virtenv/usr/include/urcu/ref.h", line=line@entry=66, function=function@entry=0x56221bdbd638 <__PRETTY_FUNCTION__.7554> "urcu_ref_put") at assert.c:101 #4 0x000056221bd6a1cc in urcu_ref_put (release=<optimized out>, ref=0x7ffa24008cb0) at /root/virtenv/usr/include/urcu/ref.h:66 lttng#5 viewer_stream_put (vstream=vstream@entry=0x7ffa24008cb0) at viewer-stream.c:279 lttng#6 0x000056221bd5e4c5 in viewer_get_metadata (conn=conn@entry=0x7ffa0c000fc0) at live.c:2211 lttng#7 0x000056221bd63778 in process_control (conn=0x7ffa0c000fc0, recv_hdr=0x7ffa297c5af0) at live.c:2376 lttng#8 thread_worker (data=<optimized out>) at live.c:2541 lttng#9 0x00007ffa2ea28609 in start_thread (arg=<optimized out>) at pthread_create.c:477 lttng#10 0x00007ffa2e94d293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Babeltrace crash: 11-17 05:13:05.197 47007 47007 D PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_get_next_index@viewer-connection.c:1494 [live] Requesting next index for stream: cmd=GET_NEXT_INDEX, viewer-stream-id=3352 11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:245 [live] Error receiving from Relay: Connection reset by peer. 11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_get_next_index@viewer-connection.c:1522 [live] Error receiving get next index reply 11-17 05:13:05.215 47007 47007 D PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_iterator_next_msg_on_stream@lttng-live.c:1069 [live] Returning from advancing live stream iterator: status=LTTNG_LIVE_ITERATOR_STATUS_ERROR, stream-name="stream-3352", viewer-stream-id=3352 11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1802 [live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR 11-17 05:13:05.216 47007 47007 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:866 Component input port message iterator's "next" method failed: iter-addr=0x563b7dd07700, iter-upstream-comp-name="live", iter-upstream-comp-log-level=TRACE, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUT PUT, iter-upstream-port-name="out", status=ERROR 11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:446 [mux] Upstream iterator's next method returned an error: status=ERROR 11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:989 [mux] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x563b7dd04740, muxer-upstream-msg-iter-wrap-addr=0x563b7dd0db30 11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1417 [mux] Cannot get next message: comp-addr=0x563b7dd1a760, muxer-comp-addr=0x563b7dd1b930, muxer-msg-iter-addr=0x563b7dd04740, msg-iter-addr=0x563b7dd07590, status=ERROR 11-17 05:13:05.216 47007 47007 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:866 Component input port message iterator's "next" method failed: iter-addr=0x563b7dd07590, iter-upstream-comp-name="mux", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR 11-17 05:13:05.216 47007 47007 W LIB/GRAPH consume_graph_sink@graph.c:462 Component's "consume" method failed: status=ERROR, comp-addr=0x563b7dd06d20, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x563b7dd051b0, comp-class-so-handle-path="/root/virtenv/usr/lib/babeltr ace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0 11-17 05:13:05.217 47007 47007 E CLI cmd_run@babeltrace2.c:2537 Graph failed to complete successfully Cause ===== Both relay and live threads can put the ownership reference on the metadata viewer stream concurrently without synchronization, thus leading to a reference count going lower than 0. The viewer stream ownership design initially planned for being owned by the live thread, thus allowing the live thread to put the ownership reference as soon as the associated relay stream is observed as closed, and the viewer stream is considered as hung up. However, in the specific case of the metadata viewer stream, the responsibility of closing the metadata viewer stream is shared between the relay and live threads, because the viewers expect to observe a LTTNG_VIEWER_NO_NEW_METADATA message before the metadata stream hangs up (see comment in viewer_get_metadata()). Therefore, if viewer_get_metadata() is done before the metadata stream is closed, the viewer will receive the LTTNG_VIEWER_NO_NEW_METADATA message, and set the no_new_metadata_notified state to true. It's then the relay thread's relay_close_stream() which will invoke the ownership put. However, the live thread may concurrently try to put the viewer stream ownership as well from a subsequent viewer_get_metadata(), thus leading to a reference count < 0. Solution ======== Fix this by putting the ownership reference from the live viewer thread rather than the relay thread. This can be done by tracking the state of no_new_metadata_notified within the live viewer thread. Known drawbacks =============== This will postpone reclaim of the metadata viewer stream from the relay stream close to the following viewer_get_metadata (after a LTTNG_VIEWER_NO_NEW_METADATA message has been replied), which I don't think is an issue. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: I9783d3a8402dddaa364a7a69c0764eaf30c16fde

When doing `lttng stop`, I get: Direct leak of 656 byte(s) in 1 object(s) allocated from: #0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45 #2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494 #3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596 #4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58 lttng#5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78 lttng#6 0x7f9706ec4604 in lttng_list_channels /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2262 lttng#7 0x55837235c4e7 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:499 lttng#8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445 lttng#9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138 lttng#10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229 lttng#11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237 lttng#12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426 lttng#13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475 lttng#14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) Direct leak of 308 byte(s) in 1 object(s) allocated from: #0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45 #2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494 #3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596 #4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58 lttng#5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78 lttng#6 0x7f9706ec421c in lttng_list_domains /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2220 lttng#7 0x55837235c3d3 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:484 lttng#8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445 lttng#9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138 lttng#10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229 lttng#11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237 lttng#12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426 lttng#13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475 lttng#14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) This is due to the get_session_stats_str function not free'ing the results of lttng_list_channels and lttng_list_domains. Fix that. Change-Id: I4c200d3df41bf09bdce8eadb000abbff7fe5a751 Signed-off-by: Simon Marchi <simon.marchi@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

When doing `lttng destroy`, I get: Direct leak of 4385 byte(s) in 1 object(s) allocated from: #0 0x7f74ae025459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x7f74add4129a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45 #2 0x7f74add42b9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494 #3 0x7f74add42f9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596 #4 0x7f74add41714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58 lttng#5 0x7f74add41747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78 lttng#6 0x7f74add4a922 in lttng_list_sessions /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2105 lttng#7 0x56472bcbdf80 in cmd_destroy /home/simark/src/lttng-tools/src/bin/lttng/commands/destroy.c:330 lttng#8 0x56472bd00764 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237 lttng#9 0x56472bd01218 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426 lttng#10 0x56472bd0151a in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475 lttng#11 0x7f74ad963b24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) This is due to cmd_destroy not free'ing the result of lttng_list_sessions. Fix that. Change-Id: Iff2e75e6ec1cdcd0bdfdbbc3d5099422e592905b Signed-off-by: Simon Marchi <simon.marchi@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Observed issue ============== The following scenario leads to an abort of lttng-sessiond. lttng-sessiond (with kernel tracing available) lttng create system-trace --snapshot -U /tmp/snapshot lttng enable-channel -k system-trace --subbuf-size=4k --num-subbuf=256 lttng enable-event -c system-trace -k 'sched_wak*' -s system-trace lttng start system-trace lttng enable-event -u -a Fails as expected with: Error: Events: The command tried to enable an event in a new domain for a session that has already been started once. (channel channel0, session system-trace) Launch any ust app such as easy_ust from the lttng-ust repository. The following backtrace is generated: (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffff7af0859 in __GI_abort () at abort.c:79 #2 0x00007ffff7af0729 in __assert_fail_base (fmt=0x7ffff7c86588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55555564b765 "usess->active", file=0x555555649a60 "ust-app.c", line #3 0x00007ffff7b01f36 in __GI___assert_fail (assertion=0x55555564b765 "usess->active", file=0x555555649a60 "ust-app.c", line=5123, function=0x55555564ecf0 <__PRETTY_FUNCTION__.14199> "ust_ #4 0x00005555555d1f5e in ust_app_global_update (usess=0x7fffe001fb90, app=0x7fffac000b80) at ust-app.c:5123 lttng#5 0x00005555555b60d4 in update_ust_app (app_sock=82) at dispatch.c:71 lttng#6 0x00005555555b7025 in thread_dispatch_ust_registration (data=0x5555556a07f0) at dispatch.c:409 lttng#7 0x00005555555ad5ab in launch_thread (data=0x5555556a0810) at thread.c:65 lttng#8 0x00007ffff7ce6609 in start_thread (arg=<optimized out>) at pthread_create.c:477 lttng#9 0x00007ffff7bed293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 This also happens for the track command. You can replace the `lttng enable-event -u -a` with `lttng track --userspace --vuid=0` then launch an app and the same backtrace gets generated. Cause ===== During `process_client_msg` the `create_ust_session` function is called and a ust session is assigned to the "system_trace" session with a state of `active` set to 0 (false). This is not a problem. The problem seems to lie with a single call site for `ust_app_global_update` in `update_ust_app`. The status of the ust session is not checked before calling the `ust_app_global_update`. It is important to note that all `ust_app_global_update_all` callsites guard the call with a check against the status of the session. Solution ======== Guard the call to `ust_app_global_update` with a check of the ust session active state. Known drawbacks ========= None. Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: I14d25d99d0609689247cdfa86130bd0219613581

Observed issue ============== The following backtrace have been reported [1]. #0 __GI_raise (sig=sig@entry=6) at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/sysdeps/unix/sysv/linux/raise.c:50 #1 0x0000003123025528 in __GI_abort () at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/stdlib/abort.c:79 #2 0x0000000000419884 in lttng_trace_archive_location_serialize (location=0x7f1c9c001160, buffer=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/location.c:230 #3 0x00000000004c8f06 in lttng_evaluation_session_rotation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/conditions/session-rotation.c:539 #4 0x00000000004a80fa in lttng_evaluation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/evaluation.c:42 lttng#5 0x00000000004bc24f in lttng_notification_serialize (notification=0x7f1cb961c310, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/notification.c:63 lttng#6 0x0000000000458b7d in notification_client_list_send_evaluation (client_list=0x7f1cb0008f90, trigger=0x7f1ca40113d0, evaluation=<optimized out>, source_object_creds=0x7f1cb000a874, client_report=0x475840 <client_handle_transmission_status>, user_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/notification-thread-events.c:4379 lttng#7 0x0000000000476586 in action_executor_generic_handler (item=0x7f1cb0009600, work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:696 lttng#8 action_work_item_execute (work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:715 lttng#9 action_executor_thread (_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:797 lttng#10 0x0000000000462327 in launch_thread (data=0x7f1cb00060b0) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/thread.c:66 lttng#11 0x0000003123408ea4 in start_thread (arg=<optimized out>) at /usr/src/debug/glibc/2.31+gitAUTOINC+f84949f1c4-r0/git/nptl/pthread_create.c:477 lttng#12 0x00000031230f8dcf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 This can be easily reproduced with the following session and trigger configuration: lttng create test lttng enable-event -u -a lttng start # Register two similar triggers via a dummy C program since rotation # completed condition is not exposed on the CLI for now. Yielding the # following triggers: lttng list-triggers - name: trigger0 owner uid: 1000 condition: session rotation completed session name: test errors: none action:notify errors: none - name: trigger1 owner uid: 1000 condition: session rotation completed session name: test errors: none action:notify errors: none lttng rotate <- abort happens here. Cause ===== The problem lies in how the location (`lttng_trace_archive_location`) object is assigned to the `lttng_evaluation` objects. A single location object can end up being shared between multiple `lttng_evaluation` objects since we iterate over all triggers and create an `lttng_evaluation` object with the location each time as needed. See `src/bin/lttng-sessiond/notification-thread-events.c:1956`. The location object is then freed when the first notification is completely serialized. The second serialization end up having a reference to a freed `lttng_trace_archive_location` object. Solution ======== Implement ref counting for the lttng_trace_archive_location object. Note ======= This also fixes a leak that was present in `cmd_destroy_session_reply`. The location is created by `session_get_trace_archive_location` and is never `destroyed`/`put`. Known drawbacks ========= None. References ========== [1] https://bugs.lttng.org/issues/1325 Fixes: #1325 Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Change-Id: I99dc595ee5b0288c727b193ed061f5273752bd24 Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Observed issue ============== A segmentation fault is observed for multiple UST timeout scenarios. Backtrace: #0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384 #1 0x0000557fe0395df9 in copy_filter_bytecode (orig_f=0x7f9c5802b790) at ust-app.c:1196 #2 0x0000557fe0397702 in shadow_copy_event (ua_event=0x7f9c58025ff0, uevent=0x7f9c58033560) at ust-app.c:1824 #3 0x0000557fe039ac46 in create_ust_app_event (ua_sess=0x7f9c5802ec20, ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, app=0x7f9c5c001da0) at ust-app.c:3192 #4 0x0000557fe03a054d in ust_app_channel_synchronize_event (ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, ua_sess=0x7f9c5802ec20, app=0x7f9c5c001da0) at ust-app.c:5096 lttng#5 0x0000557fe03a0772 in ust_app_synchronize (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5173 lttng#6 0x0000557fe03a0a70 in ust_app_global_update (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5255 lttng#7 0x0000557fe03a00e0 in ust_app_start_trace_all (usess=0x7f9c580074a0) at ust-app.c:4987 lttng#8 0x0000557fe0355c6a in cmd_start_trace (session=0x7f9c5800a190) at cmd.c:2668 lttng#9 0x0000557fe0382e70 in process_client_msg (cmd_ctx=0x7f9c58003d70, sock=0x7f9c74bf44e0, sock_error=0x7f9c74bf44e4) at client.c:1527 lttng#10 0x0000557fe03848a2 in thread_manage_clients (data=0x557fe06d9440) at client.c:2200 lttng#11 0x0000557fe037d1cb in launch_thread (data=0x557fe06d94b0) at thread.c:75 lttng#12 0x00007f9c796af609 in start_thread (arg=<optimized out>) at pthread_create.c:477 lttng#13 0x00007f9c795b6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The scenario: # Start an instrumented app ./app gdb lttng-sessiond # put a breakpoint on ustctl_set_filter lttng create my_session lttng enable-event -u tp:tp_test lttng start lttng enable-event -u __dummy --filter 'my_field == "user34"' # The tracepoint should hit. Do not continue. kill -s SIGSTOP $(pgrep app) # Continue lttng-sessiond. # enable-event will return an error. This a bug in itself, still let's # continue with the current bug. lttng stop # Start a new app that will register. ./app & sleep 1 lttng start # lttng-sessiond should segfault. Cause ===== During the "lttng enable-event" command, the timeout error bubbles up all the way to event_ust_enable_tracepoint and is different from LTTNG_UST_ERR_EXIST. `trace_ust_destroy_event` is called and frees the `uevent` object. Note that contrary to the comment `uevent` is added to the channel event hash table at this point. On the next `lttng start` command, the event node is still present in the hash table and is iterated on. lttng-sessiond segfault on the first data access of the previously freed memory. The problem was introduced by commit 88e3c2f [1]. Which essentially move the callsite of `add_unique_ust_event` before `ust_app_*_event_glb` calls. Solution ======== Go to `end` label to prevent freeing of the uevent object. Note that app synchronization should not force an error at the channel level, since a single app can fail but the whole channel should not. The `error` label is now obsolete. Known drawbacks ========= None. References ========== [1] lttng@88e3c2f Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Change-Id: Ifaf3f4c71bb2da869c7b441aaa4b367f8f7cbdd6 Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Observed issue ============== The following scenario lead to the UST thread to be "stuck" on recvmsg on the notify socket. The problem manifest itself when an application is unresponsive during the ustctl_start_session call. Note that the default timeout for ust communication is 5 seconds. # Start an instrumented app ./app gdb lttng-sessiond # put a breakpoint on ustctl_start_session lttng create my_session lttng enable-event -u -a lttng start # The tracepoint should hit. Do not continue. kill -s SIGSTOP $(pgrep app) # Continue lttng-sessiond. sleep 5 # This make sure lttng-sessiond unregister the app from its point of view kill -s SIGCONT $(pgrep app) gdb -p $(pgrep app) thread apply all bt App stack trace: Thread 3 (Thread 0x7fe2c6f58700 (LWP 48172)): #0 __libc_recvmsg (flags=0, msg=0x7fe2c6f56ac0, fd=4) at ../sysdeps/unix/sysv/linux/recvmsg.c:28 #1 __libc_recvmsg (fd=fd@entry=4, msg=msg@entry=0x7fe2c6f56ac0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:25 #2 0x00007fe2c7a010ba in ustcomm_recv_unix_sock (sock=sock@entry=4, buf=buf@entry=0x7fe2c6f56ea0, len=len@entry=48) at lttng-ust-comm.c:308 #3 0x00007fe2c7a037c3 in ustcomm_register_channel (sock=4, session=session@entry=0x7fe2c0000ba0, session_objd=<optimized out>, channel_objd=<optimized out>, nr_ctx_fields=nr_ctx_fields@entry=0, ctx_fields=<optimized out>, chan_id=0x7fe2 c6f5716c, header_type=0x7fe2c0012b18) at lttng-ust-comm.c:1544 #4 0x00007fe2c7a10787 in lttng_session_enable (session=0x7fe2c0000ba0) at lttng-events.c:444 lttng#5 0x00007fe2c7a0b785 in lttng_session_cmd (objd=1, cmd=128, arg=140611977311672, uargs=0x7fe2c6f57800, owner=0x7fe2c7a5da00 <local_apps>) at lttng-ust-abi.c:576 lttng#6 0x00007fe2c7a07d6d in handle_message (lum=0x7fe2c6f57590, sock=3, sock_info=0x7fe2c7a5da00 <local_apps>) at lttng-ust-comm.c:1003 lttng#7 ust_listener_thread (arg=0x7fe2c7a5da00 <local_apps>) at lttng-ust-comm.c:1712 lttng#8 0x00007fe2c7993609 in start_thread (arg=<optimized out>) at pthread_create.c:477 lttng#9 0x00007fe2c78ba293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ... Cause ===== When the app continues after the timeout from lttng-sessiond side, the actual start_session message is received on the application side then UST, app side, send commands on the notify socket. On lttng-sessiond side, the command is received but no reply is sent. This is due to the fact that the lookup against the ust_app_ht_by_notify_sock hash table (find_app_by_notify_sock) return nothing since the app is unregistered at this point and the hash table node was removed on unregistration. Solution ======== When the app lookup fails, return an error that will trigger the cleanup of the notify socket. Known drawbacks ========= None Note ========= Subsequent error path in reply_ust_register_channel, add_event_ust_registry, and add_enum_ust_registry might lead to the same type of problem since no reply is sent to the app. Still, for those cases the complete application/notify socket should not be destroyed since the error path relate to either a session or a sub object of a session. Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Change-Id: Iea0dc027ca1ee772e84c7e545114f1be69fd1f63 Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>